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1.  INTRODUCTION 


'  ^  In  ihe  applied  forecasting  literature  much  attention  has  been  lavished  on  questions  about  the 
evaluation  of  probability  forecasts,  and  the  subjectivist  view  of  probability  has  been  invoked  to 
aggregate  probability  forecasts  over  a  diverse  set  of  events  or  statements,  (e.g.  see  Fischhoff  and 
McGregor.  1982).  One  criterion  invoked  in  such  evaluations  is  that  of  calibration:  a  set  of 
statements  or  events  is  considered  and  we  ask  if  x  percent  of  those  assigned  probability  x  of 
being  correct  prove  to  be  correct,  for  each  value  of  x.  From  this  perspective,  weather 
forecasters  generally  have  been  found  to  perform  well_  (Murphy  and  Winkler.  1 9^4.  1 977). 
What  is  especially  helpful  in  the  evaluation  of  such  probability  forecasters  is  that  they  make 
forecasts  about  a  long  sequence  of  events  (e.g.  rain  on  a  given  day),  and  thus  it  makes  sense 
to  think  about  probability  functions  associated  with  the  forecasts.  In  this  paper  we  focus  on  a 
criterion  for  comparing  forecasters,  refinement,  which  goes  beyond  that  of  calibration , (see  the 
related  discussion  in  Winkler.  1982). 

The  formal  setting  we  consider  is  the  same  as  that  presented  in  DeGroot  and  Fienberg  (1982. 
1983).  Consider  two  forecasters  who  at  the  beginning  of  each  period  n  in  a  sequential  process 
(n  =  1.2....)  must  independently  specify  their  subjective  probabilities  ihai  a  particular  event  A 

n 

will  occur  during  the  period.  Assume  that  each  forecaster  in  specifying  the  probability  of  A 

n 

is  aware  of  the  values  of  various  variables  that  are  potentially  relevant  to  the  occurrence  of 
A  .  including  which  of  the  previous  events  A  .  A .  A  have  actuallv  occurred.  We  wish 

n  I  2  n-1 

to  compare  the  two  forecasters  on  the  basis  of  their  subjective  probabilities  of  the  events  A(, 

A, .  A  and  the  subsequent  observation  of  exactly  which  of  those  events  occurred,  for  large 

values  of  n. 

It  is  possible  to  think  of  our  forecasters  as  economists  who  at  the  beginning  of  each  period 
must  specify  their  probabilities  that  the  value  of  a  particular  commodity  will  rise  during  the 
period,  or  as  medical  diagnosticians  who  specify  their  probabilities  that  a  patient  has  a 
particular  disorder  (e.g.  see  Habbema,  Hilden,  and  Bjcrregaard.  1978.  and  Hilden.  Habbema  and 
Bjerregaard,  1978a.  1978b).  As  in  DeGroot  and  Fienberg  (1982.  1983).  however,  we  present  the 
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basic  results  of  the  first  four  sections  of  this  paper  in  the  context  of  two  weather  forecasters. 
Day  after  day.  they  must  specify  their  subjective  probabilities,  x  and  y,  that  there  will  be  at 
least  a  certain  amount  of  rain  at  some  given  location  during  a  specified  time  period  in  the 
day,  and  we  refer  to  the  occurrence  of  this  well-specified  event  as  "rain."  We  make  the 
assumption  that  x  and  y  are  restricted  to  a  given  finite  set  of  values  0  =  x^  <  X|  <  ...  <  x^  = 
1.  and  we  let  X  denote  the  set  {x  ,  x  ,  x .  x  ). 

0  12  k 

In  Section  2,  we  present  the  basic  notation  and  formal  definitions  of  the  concepts  of 
calibration  and  refinement.  Then  we  go  on  to  describe  the  relationship  between  these 
concepts  and  the  classical  concept  of  sufficiency .  and  we  summarize  various  results  on  when 
one  well-calibrated  forecaster  is  at  least  as  refined  as  another  given  in  DeGroot  and  Fienberg 
(1982,  1983)  and  DeGroot  and  Eriksson  (1983).  In  Section  3.  we  present  two  concepts 
introduced  by  Vardeman  and  Meedan  (1983).  semi -calibration  and  rain  (or  dry  (-domination. 
which  impose  different  restrictions  on  the  probability  distributions  of  the  forecasters  than  do 
calibration  and  sufficiency  or  refinement,  and  we  discuss  how  these  different  concepts  are 
related. 

In  Section  4.  we  introduce  the  notion  of  strictly  proper  scoring  rules  and  describe  their  use 
in  comparing  forecasters.  We  give  a  proof  of  a  basic  partitioning  result  for  strictly  proper 
scoring  rules,  described  earlier  in  DeGroot  and  Fienberg  (1983).  We  also  give  a  result  on  the 
relationship  between  Schur-convex  measures  of  the  quality  of  forecasters  and  strictly  proper 
scoring  rules  presented  in  DeGroot  and  Eriksson  (1983). 

In  Section  5,  we  turn  to  the  multivariate  or  vector-probability  forecasting  situation  in  which 
the  events  of  interest  have  three  or  more  possible  outcomes.  For  example,  in  the  weather 
context  the  event  may  be  tomorrow’s  maximum  temperature  and  the  possible  outcomes  may  be 
grouped  into  5°C  intervals.  The  forecasters  are  required  to  announce  their  subjective 
probabilities  associated  with  each  of  the  s  2  3  possible  outcomes. 
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2.  CALIBRATION  AND  REFINEMENT 

We  begin  by  considering  two  forecasters.  A  and  B.  and  we  define  the  joint  probability 
function.  g(x.y,0).  for  the  random  variables  associated  with  the  ith  event  E  in  a  sequence 
where  x  and  y  are  forecaster  A's  and  B's  subjective  probabilities  of  rain,  respectively,  and  9  - 
1  if  the  outcome  is  rain  or  9  =  0  otherwise.  The  overall  performance  of  forecaster  A  can  be 
characterized  by  two  functions  defined  on  X: 

(i)  the  probability  function 

i  (x)  =  I  X  g(x.y .9) 

A  %  t  A 

V 

which  gives  the  probabilities  or  relaiive  frequencies  with  which  forecaster  A  makes 
each  possible  prediction  x. 

(ii)  the  conditional  probability  function 

p  (\)  =  X^  g(x,y.l)/vA(x) 

which  give  the  conditional  probability  or  relative  frequency  of  rain  given  forecaster 
A’s  specific  prediction  x. 

The  functions  »•  (x)  and  />B<x)  for  forecaster  B  can  be  defined  similarly.  Finally  the  long  run 
frequency  of  rain  is 

M  =  Xx  Xx  g(x.y.l). 

A  forecaster  is  said  to  be  wel / -cal ibrated  if  /j(x)  =  x  for  all  x  *  X  such  that  v (x)  *  0. 
Thus  a  forecaster  is  well-calibrated  if  the  forecaster’s  predictions  can  be  accepted  at  face 
value,  i.e.  given  that  the  forecaster  predicts  x,  the  conditional  probability  of  rain  is  x.  DeGroot 
and  Eriksson  (1983)  give  three  different  interpretations  for  the  quantities  v(x)  and  />(x).  based 
on  (i)  limiting  or  theoretical  values  for  an  infinite  sequence  of  days,  (ii)  the  actual  relative 
frequencies  for  a  finite  sequence  of  n  days,  and  (iii)  subjective  probabilities  of  an  observer 
who  is  comparing  the  forecasters.  We  proceed  using  interpretation  (iii),  although  we  sometimes 
use  language  that  is  associated  with  (i). 
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For  various  reasons,  being  wcll-calibraied  is  usually  regarded  as  a  desirable  characteristic  of  a 
forecaster.  For  example.  Pratt  (1962)  and  Dawid  (1982)  show  that  a  probability  forecaster  who 
is  coherent  in  the  sense  of  de  Finetti  (1937)  must  be  calibrated  almost  surely.  However,  as 
Murphy  and  Winkler  (1977).  Dawid  (1982).  and  DeGrool  and  Ficnbcrg  (1983)  note,  even  if  a 
forecaster  is  well-calibrated  his  predictions  arc  not  necessarily  accurate  in  all  respects  nor  are 
the}  necessarily  of  much  use  to  anyone.  For  example,  the  forecaster  whose  prediction  on  each 
day  is  p  is  well-calibrated  but  clearly  useless  as  a  forecaster  once  we  know 

Consider  two  well-calibrated  forecasters  A  and  B  whose  predictions  are  characterised  by  their 
probability  functions  v  and  i  .  In  DeGroot  and  Fienberg  (1982.  1983).  we  introduce  a 
concept  of  refinement  which  induces  a  partial  ordering  on  the  class  of  all  well-calibrated 
forecasters.  This  concept  is  defined  as  follows: 

A  stochastic  transformation  h(y  |  x)  is  a  non-negative  function  defined  on  XxX  such  that 

Z  ^  h(y  |  x)  =  1  for  every  x«X.  (2.1) 

Forecaster  A  is  said  to  be  at  least  as  refined  as  forecaster  B  if  there  exists  a  stochastic 
transformation  h(y  |  x)  such  that 

Z  y  h(y  I  x)  *•  (x)  =  v  (v)  for  y»X.  (2.2) 

Z  y  h(y|x)xv  (x)  =  y»  (y)  for  y«X.  (2.3) 

Following  DeGroot  and  Eriksson  (1983)  we  denote  this  relationship  by  the  symbols  A^.B.  The 
stochastic  transformation  here  plays  the  role  of  an  auxiliary  randomization  which  we  could  use 
to  generate  predictions  with  distribution  v  (y)  from  A's  predictions,  as  in  (2.2).  Equation  (2.3) 
is  required  in  the  definition  to  ensure  that  the  predictions  generated  by  this  process  are  well- 
calibrated. 

If  A^_B  and  the  probability  functions  and  »•  are  not  identically  equal,  then  A  is  said 
to  be  more  refined  than  B.  Wc  denote  this  relationship  by  the  symbols  A  B. 


The  relationship  A  /_  B  is  both  reflexive  and  transitive,  and  induces  a  partial  ordering  (but 
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not  a  total  ordering)  among  well-calibrated  forecasters.  In  these  terms,  the  forecaster  who 
makes  the  same  prediction  ^  each  day  is  least- ref ined  in  the  sense  that  any  other  well- 
calibrated  forecaster  is  at  least  as  refined  as  he  is.  It  is  possible  that  ^  is  not  one  of  the 

allowable  predictions  x  .  x x  .  In  that  case,  there  is  a  value  of  i  (i  =  0.1 . k-1)  such 

OIL 

that  \  <  ^  <  x  .  and  DeGroot  and  Ficnbcrg  ( 19S2)  show  that  a  forecaster  who  uses  on!}  the 

values  x  and  x  (  as  his  predictions  in  such  a  way  that  he  is  well  calibrated  will  now  be  least- 

refined.  At  the  other  extreme,  the  forecaster  whose  prediction  each  day  is  either  x  =  0  or  x 
=  1  and  who  is  always  correct  is  most- ref i ned  in  the  sense  that  he  is  more  refined  than  any 
o'hcr  well-calibrated  forecaster. 

For  any  well-calibrated  forecaster  it  must  be  true  that 

xi  (x)  =  fj  .  (2.4) 

Since  every  distribution  with  mean  ^  can  be  thought  of  as  the  v(x)  of  some  well-calibTated 
forecaster,  the  comparison  of  well-calibrated  forecasters  using  the  relationship  is  equivalent  to 
the  problem  of  the  comparison  of  all  distributions  on  X  with  a  given  mean  /j. 

The  left-hand  side  of  expression  (2.3)  resembles  the  form  of  a  conditional  expectation.  This 
observation  leads  to  the  following  result: 

Theorem  1  (DeGroot  and  Eriksson.  1983):  The  relationship  A  B  is  satisfied  if  and 
only  if  there  exist  discrete  random  variables  X  and  Y  such  that  the  marginal 
probability  distribution  of  X  is  »■ ■  .  the  marginal  probability  distribution  of  Y  is  >  b_ 
and  E(X  |  Y)  =  Y. 

We  temporarily  step  back  and  consider  the  comparison  of  two  arbitrary  forecasters  A  and  B. 
who  are  not  necessarily  well  calibrated.  For  any  given  forecaster,  let  f(\  1 6)  denote  the 
conditional  probability  function  of  the  forecaster's  predictions  given  8.  Thus.  f(x|l)  can  be 
regarded  as  the  frequency  function  of  the  forecaster's  predictions  on  days  when  rain  actually 
occurs,  and  f(x|0)  as  the  frequency  function  on  days  when  rain  docs  not  occur.  It  follows 
that  for  x»X. 


/ 
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f(x  1 1)  =  /,(xMx)/,i  .  (2.5) 

f (x  1 0)  =  U  -  p{x))vM/i\  -  p)  .  (2.6) 

Thus  we  can  also  use  the  probability  function  f(x|0)  for  Q  =  1  and  6  -  0  to  characterize  a 
forecaster's  predictions.  Let  forecasters  A  and  B  hate  conditional  probability  functions  f  (x  j  8) 
and  My  |  tf).  Following  Blackwell's  (1*151.  1952)  work  on  the  comparison  of  experiments  we 
say  that  A  is  sufficient  for  B  if  there  exists  a  stochastic  transformation  h(y  |  x)  such  that 

Z  y  h(v  I  x)  f  (x  I  0)  =  f  (v  1  6)  for  v,X  and  6  =  0.1  .  (2.7) 

Using  the  relationships  (2.5)  and  (2.6)  we  can  now  prove 


Theorem  2  (DcGroot  and  Fienberg,  1982):  Consider  two  forecasters  A  and  B  whose 
predictions  are  characterized  by  >Mx),  /Mx).  i  (x).  and  /Mx).  Then  A  is  sufficient 
for  B  if  and  only  if  there  exists  a  stochastic  transformation  h  such  that 


and 


Zx  h(y|x)vA(x)  =  .My)  for  y «X.  (2.8) 

Zx  My|x)/Mx)»  a(x)  =  />B(y)vB(y)  for  y«X.  (2.9) 


Now  if  we  again  restrict  attention  to  well  calibrated  forecasters  we  get  the  following  corollary: 


Corollary  1:  Consider  two  well-calibrated  forecasters  A  and  B.  Then  A  X.  B  if  and 
only  if  A  is  sufficient  for  B. 


Thus  we  can  use  results  from  the  comparison  of  experiments  to  compare  two  well-calibrated 
probability  forecasters  A  and  B.  For  any  well-calibrated  forecaster,  let  F  denote  the 
distribution  function  corresponding  to  the  probability  function  i  .  i.e. 

F(t)  =  Zi  (x)  for  0  <  t  SI  .  (2.10) 

(x:x«  X.x<t) 

Theorem  3  (DcGroot  and  Eriksson,  1983):  The  relationship  A  B  is  satisfied  if  and 
only  if 

f  F  (t)dt  >  V  F  (t)dt  for  all  0  <  s  <  1  .  (2.11) 

JO  A  Jo  B 

The  relationship  (2.11)  is  one  of  several  equivalent  definitions  of  second-degree  stochastic 
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dominance  (e.g.  see  Fishburn  and  Vickson,  1978),  and  we  see  that  it  is  equivalent  to  the 
relationship  A  B.  This  leads  to  yet  another  equivalence: 

Theorem  4  (Hardy,  Littlewood,  and  Polya.  1919.  1934):  The  relationship  A  B  is 
satisfied  if  and  only  if,  for  every  continuous,  convex  function  c  defined  on  the 
unit  interval. 

I  y  c(x)  »•  (x)  2  I  y  c(\)v  (x)  .  (2.12) 

>(A  a  ifA  B 

Finally,  there  is  a  simplification  of  the  second-order  stochastic  dominance  relationship  that 
can  be  expressed  only  in  terms  of  the  probability  functions  v  and  »■  . 

A  B 

Theorem  5  (DeGroot  and  Fienberg.  1982):  The  relationship  A  B  is  satisfied  if  and 
only  if 

Irl  (x-xHv  (x)-v  (x)>  £  0  for  j  =  1 .  k-1  .  (2.13) 

i-0  j  i  A  i  B  i 

If  at  least  one  inequality  in  (2.13)  is  strict,  then  A^B. 

By  a  direct  application  of  this  theorem,  we  can  gain  insight  into  the  refinement  relationship 
through  the  following  Corollary: 

Corollary  2:  Suppose  A  and  B  are  well  calibrated  forecasters  such  that 

V  (x)  >  V  (x)  i  =  1 .  k-1  (2.14) 

B  I  A  l 

and  v  (x  )  =  v  (x  )  =  0.  Then  A^.  B.  If  one  of  the  inequalities  in  (2.14)  is  strict. 

BO  B  k  —  ^ 

then  A  B. 

Thus  we  see  that,  in  a  rough  sense.  A  is  more  refined  than  B  if  A  spreads  his  probabilities 
into  the  extremes  (in  this  case  to  0  to  1)  more  than  B. 

As  we  have  seen  from  (3.5).  in  order  to  determine  whether  A  is  sufficient  for  B.  we  need 
only  know  the  marginal  probability  functions  f  (x  |  6)  and  f  (x)0)  of  A  and  B  separately. 
However,  in  order  to  determine  whether  forecaster  A  is  sufficient  for  both  himself  and 


forecaster  B  together,  we  must  know  the  joint  probability  function  i(x,y)  =  g(x.y.l)  +  g(x,y.O) 
of  the  predictions  x  and  y  of  A  and  B.  as  well  as  the  conditional  probability  of  rain  given  the 
two  predictions  x  and  y: 

p(\. y)  =  g(x.y.l)  f  v(x.y)  .  (2.15) 

Then  we  can  use  the  following  result: 

Theorem  6  (DeGroot  and  Ficnberg.  1983):  Forecaster  A  is  sufficient  for  the  pair  of 
forecasters  (A.B)  if  and  only  if 

-Xx.y)  =  p  (x)  for  x«X  and  y < X  (2.16) 

If  follows  that,  if  A  sufficient  for  (X.B).  then  A  is  sufficient  for  B.  The  converse,  however,  is 
not  necessarily  true. 

Suppose  now  that  neither  A  nor  B  is  sufficient  for  the  other.  It  becomes  natural  to  ask  if 
we  can  do  better  than  A  or  B  by  using  only  their  predictions.  One  way  to  try  to  do  this  is 
to  choose  A's  prediction  with  probability  a  and  B's  with  probability  1  -a.  This  results  in  a 
"new  forecaster"  whom  we  label  as  M(c).  If  both  A  and  B  are  well-calibrated  then  it  is 
straightforward  to  show  that  M(a)  is  also  well-calibrated.  Furthermore,  it  follows  directly 
from  Theorem  3  that  M(a)^.A  if  and  only  if  B^  A.  and  that  M(c)_^  B  if  and  only  if  A  h. 
B.  Thus,  randomly  mixing  two  well-calibrated  forecasters  does  not  allow  us  to  improve  upon 
either  of  them  in  the  refinement  sense. 

The  second  way  to  use  the  predictions  of  two  forecasters  is  to  average  them  (or  more 
generally  take  linear  combinations).  Unfortunately,  if  A  and  B  are  well-calibrated  it  does  not 
necessarily  follow  that  the  average  of  A's  and  B's  predictions  will  be  well-calibrated.  This  is 
most  easily  seen  if  we  average  the  predictions  of  the  most-refined  well-calibrated  forecaster, 
who  always  correctly  forecasts  0  or  1.  and  the  predictions  of  the  least-refined  well-calibrated 
forecaster,  who  always  forecasts  ft.  Thus  if  we  wish  to  improve  upon  A  and  B  by  averaging 
their  predictions,  there  is  the  potential  loss  of  calibration.  Moreover,  even  if  the  result  of 
averaging  A  s  and  B's  forecasts  is  well-calibrated,  it  remains  unclear  whether  the  distribution  of 
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the  a\erage  can  be  more  spread  than  As  and  B's  distributions. 

3.  RESTRICTED  COMPARISONS 

As  in  many  other  statistical  decision  problems,  there  are  essentially  two  different  types  of 
error  that  the  forecaster  might  make.  He  might  predict  a  high  probability  of  rain  on  a  day 
when  it  docs  not  rain  or  he  might  predict  a  low  probability  of  rain  on  a  day  when  it  does 
ram.  Some  forecasters  may  control  one  type  of  error  better  than  the  other.  We  may  thus 
wish  to  consider  how  to  compare  two  forecasters  separately  for  days  on  which  it  rains  and  for 
days  on  which  it  does  not  (i.e.  dry  days).  This  leads  to  two  notions  of  dominance,  introduced 
by  Vardeman  and  Mecdcn  (1983).  which  arc  both  forms  of  first-degree  stochastic  dominance 
(again  see  Fishburn  and  Vickson.  1978). 

We  say  that  forecaster  A  rain  dominates  forecaster  B  if 

T  f  (x  1 1)  <  V  f  (x  1 1)  for  j  =  0.1.2 .  k.  (3.1) 

i-O  A  i1  .-0  B  l  1  J 

dry  dominates  forecaster  B  if 

T  f  (x  |0)  >  r  f  (x  10)  for  j  =  0.1.2 .  k.  (3.2) 

i-O  A  i  1  i-O  B  l  1 

and  dominates  forecaster  B  if  both  (3.1)  and  (3.2)  hold.  Condition  (3.1)  says  that  on  rainy 
days.  A's  predictions  have  a  distribution  which  is  stochastically  larger  than  the  distribution  of 
B's  predictions,  and  Condition  (3.2)  says  that  for  dry  days  A's  predictions  are  stochastically 
smaller  than  B's  predictions. 

The  relationships  of  rain  domination,  dry  domination,  and  domination,  like  that  of  refinement 
(or  sufficiency),  each  induce  their  own  partial  ordering  amongst  forecasters.  They  also  provide 
alternative  ways  of  demonstrating  refinement,  as  the  following  result  shows: 

Theorem  7  (Vardeman  and  Mecdcn,  1983):  For  two  well-calibrated  forecasters  A  and 
B.  if  cither  (i)  A  rain  dominates  B  or  (li)  A  dry  dominates  B.  then  A  B. 

The  relationship  of  domination  is  quite  stringent  and  thus  it  docs  not  need  to  be  combined 
with  a  condition  as  strong  as  calibration  to  imply  sufficiency.  Following  Vardeman  and 


Meedcn.  we  say  that  a  forecaster  is  semi  -  cal ibrated  if  ^(x)  is  nondecrcasing  in  x  for  those 
salues  of  x  with  i  (x)  >  0. 

Theorem  8  (Vardeman  and  Meeden.  1983):  If  forecasters  A  and  B  are  both  semi- 
calibralcd  and  A  dominates  B.  then  A  is  sufficient  for  B. 

Vardeman  and  Meeden  (19S3)  go  on  to  use  the  concepts  of  domination  and  semi-calibration, 
along  with  calibration  and  refinement,  to  make  comparisons  between  Bayesian  forecasters  who 
use  stationary  n-sicp  Markos  chain  representations  for  the  sequence  of  outcomes  6. 

4.  STRICTLY  PROPER  SCORING  RILES 

It  has  often  been  suggested  in  the  statistical  literature  that  a  forecas  predictions  over  a 
sequence  of  days  can  be  evaluated  by  the  use  of  a  scoring  rule  whi  issigns  a  numerical 
value,  or  score,  each  day  based  on  the  forecaster's  prediction  x  and  the  mon  of  whether 

or  not  rain  occurred,  i.c.  the  observation  of  8.  One  property  of  the  use  of  such  rules,  when 
the  forecaster  attempts  to  maximize  the  expectation  of  this  score,  is  that  if  the  forecaster's 
predictions  are  not  restricted  to  be  probabilities,  then  there  is  a  known  transform  of  the  values 
of  x  to  values  which  arc  probabilities  (Lindley.  1983).  For  the  class  of  proper  scoring  rules 
described  below,  the  values  of  x  must  themselves  be  probabilities.  There  is  little  reason, 
however,  to  believe  that  a  forecaster  will  want  to  maximize  his  expected  overall  score  (e.g.  see 
the  discussion  on  this  point  in  DcGroot  and  Ficnbcrg.  1983.  and  in  Stael  von  Holstein.  1970). 

Our  interest  in  scoring  rules  in  .the  context  of  comparing  forecasters  is  somewhat  different. 
Since  we  know  that  the  relationship  A  B  induces  only  a  partial  ordering  on  the  class  of 
well-calibrated  forecasters,  we  wish  to  assign  a  measure  of  quality  m(<)  to  the  probability 
function  r  of  every  well-calibrated  forecaster  in  order  to  obtain  a  total  ordering  of  this  class. 
The  values  m(  > )  should  be  assigned  in  such  a  way  that  the  better  the  forecaster,  the  higher  his 
measure  of  quality  will  be.  It  is  natural  to  interpret  this  requirement  to  mean  that  if  A  )  B 
then  m(  i  J  slr>ct  inequality  unless  the  probability  functions  v  and  r  arc 

identical.  Indeed,  we  need  not  restrict  attention  to  only  well-calibrated  forecasters.  It  is 
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convenient  to  arr;'.  e  at  me  measure  m()  for  the  comparison  of  well-calibrated  forecasters  from 
the  more  general  approach  of  the  use  of  scoring  rules  which  are  applicable  to  all  forecasters. 

We  begin  by  considering  an  arbitrary  scoring  rule.  Suppose  that  the  forecaster's  prediction  is 
\  and  rain  occurs.  Then  tiie  forecaster  receives  a  score  g  (0.  If  ram  docs  no!  occur  he 
receives  a  score  g  (\>.  Since  we  assume  that  the  forecaster  desires  to  maximize  his  score,  it  is 
reasonable  to  assume  ihat  g  (x)  is  an  increasing  function  of  \  and  that  g  (x)  is  a  decreasing 
function  of  x.  If  the  forecaster's  actual  subjective  probability  of  rain  on  a  particular  day  is  p 
and  he  makes  the  prediction  \.  then  his  expected  score  is 

pg(x)  -  (l-p)g  (x)  .  (4.1) 

A  proper  scoring  rule  is  one  for  which  expression  (4.1)  is  maximized  when  x  =  p.  A  strictly 
proper  scoring  rule  is  one  for  which  x  =  p  is  the  only  value  of  x  that  maximizes  expression 
(4.1).  An  interesting  discussion  of  these  rules,  with  historical  references,  is  given  by  Stael  von 
Holstein  (1970.  Sec.  3.2). 

Examples  of  strictly  proper  scoring  rules  include  the  quadratic  rule  (Brier,  1950;  dc  Finetti. 
1%2.  1965)  with  g  (x)  =  — (x— 1)'  and  g  (\>  =  -x\  anc  the  logarithmic  rule  (Good.  1952)  with 
g  (x)  =  log  x  and  g  (x)  =  log  (1-x).  Both  of  these  examples  have  the  symmetry  property,  g  (x) 
=  g,(l-x).  but  (his  is  not  a  requirement  of  strict'v  proper  rules.  An  example  of  an  improper 
scoring  rule  is  the  exponential  with  g  (x)  =  e' '  and  g  (x)  =  e"' ;.  Here  the  values  that 
maximize  the  expected  score  are  x  =  Jogf(l-p)/p}.  i.e.  the  log-odds. 

If  a  proper  scoring  rule  is  used  for  all  of  a  forecaster's  predictions  then  we  get  an  overall 
score  S  for  the  forecaster.  Among  all  days  on  which  the  forecaster's  prediction  is  x.  the  score 
will  be  g  (x)  with  relative  frequency  ^(x)  and  g  (x)  with  relative  frequency  1  — />(x>.  Since  the 
relative  frequency  of  the  prediction  x  for  the  forecaster  is  i  (x).  we  have  that 

S  =  I  ^  i  (x)  (/j(x)g  (x)  *  tl-/»(x)]  g_(x)l  .  (4.2) 

We  now  come  to  the  major  result  of  this  Section,  which  shows  that  this  overall  score  (4.2) 
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can  be  partitioned  into  a  component  which  is  a  measure  of  calibration  and  a  component  which 
measures  sufficiency  or  refinement. 

Theorem  9  (DeGrooi  and  Fienberg,  1983):  If  a  forecaster's  predictions  are 
characterized  by  the  functions  >  (\>  and  ,/lx).  and  if  a  proper  sco-mg  rule  is 
specified  by  the  functions  g  (x)  and  g  (x),  then  the  forecaster's  overall  score  S  can 
be  expressed  in  the  form  S  =  S  .  ■»  S  .  where 

S  =  I  x  ,(\)  [/><x>  tg  (x)-g  *  {l-,v(x>}  [gjsl-gj^lll].  (4.3) 

S  =  I  ^  I  (\)c  f  \) } .  (4  4) 

and 

f»(t)  =  lg.(t)  -  (l-t)g,(t)  for  0  <  t  <  1  .  (4.5) 

If  the  scoring  rule  is  strictlv  proper  then  ?{:)  is  strictly  convex  and  S  attains  its 

i 

maximum  value  only  when  ^(x)  =  x  for  every  value  of  x  such  that  >  (x)  >  0. 

Proof:  It  can  be  verified  dircctlv  that  if  S  and  S  are  given  bv  (4.3)  to  (4.5)  then  S  *  S  is 

i  :  i  : 

given  by  expression  (4.2)  for  S.  Thus  the  first  part  of  the  theorem  is  established.  Now 
suppose  that  the  scoring  rule  is  strictly  proper.  To  see  that  p(t)  is  convex,  note  that 

?(t)  =  c^i^x  [igU)  *  (l-t)g  (x)3.  (4.6) 

In  (4.6).  »(t)  is  represented  as  the  maximum  of  a  family  of  linear  functions  of  t.  Hence.  p(t) 
is  convex. 

If  p(t)  is  not  strictly  convex,  then  it  contains  at  least  one  linear  segment.  This  means  that 
there  must  be  one  particular  linear  function  that  yields  the  maximum  value  in  (4.6)  for  all  the 
values  of  t  in  some  interval.  But  that  is  impossible,  because  we  know  that  at  any  particular 
value  t  =  t  (0<l  <i)  the  maximum  in  (4.6)  is  attained  uniquelv  bv  the  linear  function  tg  (t  )  * 
(l-t)g  (t  ).  Hence.  <p( t)  must  be  strictly  convex. 

Finally,  since  the  scoring  rule  is  strictly  proper  we  know  that 


/7(x)g  (x)  *  [l-y>(x)]g,(x)  S 
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/>(v)gi  [  /?(x) 3  ♦  [l-/»(\)]gj/>(x)].  (4.7) 

with  strict  inequality  unless  x  =  ^(x).  Hence,  it  can  be  seen  from  (4.3)  that  S  will  be 
negative  unless  ^(x)  =  x  for  every  value  of  x  such  that  r(x)  >  0.  in  which  case  S  will  be 

equal  to  zero.  ■ 

As  we  noted  above.  S  is  a  measure  of  the  forecaster's  calibration,  which  is  zero  only  for  a 
well-calibrated  forecaster  and  negative  otherwise.  The  component  S  provides  us  with  our 

sought-after  measure  to  give  a  total  ordering  for  well-calibrated  forecasters: 

m(r)  =  Z  ^  ?(x)i  (x).  (4.8) 

which  is  S  with  ^(x)  =  x.  Since  p(x)  is  strictly  convex.  Theorem  4  implies  that  if  A  B 

then  m(  i  )  >  m(  i  ).  with  strict  inequality  unless  »•  and  »•  are  identical. 

V  B  A  B 

DeGroot  and  Eriksson  (1983)  note  that  there  is  a  direct  relationship  between  the  total 
ordering  provided  by  the  measure  m  and  the  concept  of  Schur-convexity  which  plays  an 
important  role  elsewhere  in  statistics  (Marshall  and  Olkin.  1979).  Consider  a  function  m 
defined  on  the  class  of  all  probability  distributions  >  over  X  that  have  a  given  mean  Then 
m  is  said  to  be  stricth  Schur-convex  if  m(v  )  >  m(v  )  whenever  the  relation  (2.11)  is 

A  B 

satisfied  with  strict  inequality  unless  >•  =  i  .  The  following  result  now  follows  directly  from 

A  B 

Theorem  4. 

Theorem  10  (DeGroot  and  Eriksson.  1983):  Consider  a  strictly  proper  scoring  rule 
based  on  the  functions  g^  and  g^.  and  suppose  that  a  measure  of  quality  m  is 
defined  by  (4.8)  and  (4.5).  Then  m  is  strictly  Schur-convex. 

Suppose  now  that  we  know  the  functions  » (x)  and  p(\)  that  characterize  a  particular 
forecaster's  predictions.  Is  it  possible  for  us  to  use  his  predictions,  and  no  other  relevant 
meteorological  information,  to  make  our  own  predictions  and  to  attain  a  larger  value  of  the 
score  S  than  the  forecaster  himself?  The  following  argument  generalizes  the  one  given  in 
DeGroot  and  Ficnbcrg  (1983)  for  the  quadratic  scoring  rule  (sec  also  Schcrvish.  1983). 
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The  forecaster's  score  is  gnen  by  expression  (4.2).  In  order  for  us  to  make  our  predictions, 
we  must  choose  a  stochastic  transformation  h(x  j  y)  as  follows.  If  the  forecaster's  prediction  on 

a  given  day  is  y.  then  we  choose  our  prediction  at  random  from  X  in  accordance  with  the 

conditional  distribution  h(x  |  y).  With  this  procedure,  our  predictions  arc  characterised  by  the 

f  unctions 

i  (x)  =  I  y  h(x  I  v)v(v)  .  (4.9) 

PM  =  Ix  h(x  |  >•)/»(>•)»■(>■)/ v  (x)  .  (4.10) 

It  follows  from  expressions  (4.2).  (4.9),  and  (4.10)  after  some  algebra,  that  our  score  is 

S  =  Iv  »<v)  [/>(v)g(x)  -  (l~/>(v))g  (x  >  3  h(\  1  v)  .  (4.11) 

o  ;(X  x  *  X  "  •  c  i  r  •  e:  1 ' 

For  each  fixed  value  of  y.  the  summation  over  x  in  expression  (4.11)  yields  a  weighted  average 
of  the  quantities 

P( y)g  (x)  *  { 1— yo<y) }  g,(x)  .  (4.12) 

with  weights  given  by  the  conditional  probabilities  h(x  |  y).  Thus  to  maximize  the  weighted 

a\cragc.  we  choose  the  conditional  distribution  h(x  |  y)  to  put  all  the  probability  on  the  value 

of  x  that  maximize  expression  (4.12).  If  ^(y)<X  the  maximizing  value  is  x  =  p( y).  and  we 
make  the  forecaster  well-calibrated.  With  this  choice,  our  value  of  remains  the  same  as 

that  of  the  original  forecaster,  but  our  value  of  S  is  now  increased  to  0.  If  then  we 

come  as  close  to  the  maximum  of  expression  (4.12)  as  possible,  by  setting  .x  equal  to  the 
permissible  value  close  to  />(y)  that  maximizes  (4.12).  i.e.  we  make  the  forecaster  almost  well- 

calibrated.  Formally,  a  forecaster  is  said  to  be  almost  we/l-calibratecf  (relative  to  the  strictly 

proper  scoring  rule  defined  by  g  and  g  )  if  for  each  point  y*X  such  that  >  (y)  >  0.  the 
expression  (4.12)  is  maximized  over  the  points  x<X  when  x  =  y.  Following  Schcrvish  (1983),  if 
we  take  an  arbitrary  forecaster  B  we  refer  to  a  second  forecaster  A  who  uses  this  concentrated 
function  h(xjy)  to  transform  the  predictions  of  B  into  his  own  as  'The  almost  calibrated 
xcrsion  of  B."  Then  our  result  is: 

Theorem  Jl.  Consider  a  strictly  proper  scoring  rule.  Let  B  be  any  forecaster  and 
let  A  be  the  almost  calibrated  version  of  B.  If  B  is  not  almost  well  calibrated,  then 
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A  has  a  strictly  larger  score  than  B. 

This  theorem  can  be  viewed  as  providing  motivation  for  the  idea  of  recalibrating  forecasters 
suggested  by  Lindley.  Tvcrsky.  and  Brown  (1919). 

5.  COMPARING  Ml'LTIVARl.ATE  FORECASTERS 

We  now  turn  to  a  consideration  of  forecasting  events  with  s  >  2  outcomes  (e.g.  a  set  of 
temperature  ranges).  In  such  settings  the  probability  forecaster  specifies  a  vector  of 

probabilities  x  =  (x.x^ x ).  restricted  to  a  finite  set  X  of  values  lying  in  the  (s-1)- 

dimcnsional  simplex,  i.e.  x  >  0  and  Z  x  =  1.  If  the  conditional  probabilities  of  the  s 

outcomes  given  the  prediction  x  arc  represented  in  vector  form  by  ^(x)  =  [^  (x).  p(x) . 

y>(x)L  then  the  multivariate  forecaster  is  well-calibrated  if  p(x)  =  x  for  all  x<X".  Note  that 
this  well-calibrated  multivariate  forecaster  is  also  well-calibrated,  in  the  sense  of  Section  2.  for 
each  binary  problem  formed  by  combining  the  s  outcomes  into  two  groups:  however,  a 
forecaster  who  is  "marginally"  well-calibrated  for  predicting  "rain"  or  "no  ram"  may  no  longer 
be  well-calibrated  when  "rain"  is  divided  into  two  or  more  possible  outcomes. 

More  formally,  let  x  =  (x . x)  and  p[x)  =  [^(x) />(x)].  Furthermore,  let  I  =  U  M 

represent  a  partition  of  the  set  U . s)  into  k  nonempty,  mutually  exclusive,  and  exhaustive 

sets  I( . I  .  Then  a  forecaster  is  said  to  be  margi nal ly  well-cal ibrated  with  respect  to  the 

partition  I  if 

Z  p  (x)  =  Z  x  for  j  =  1 . k  and  x<X  '.  (5.1) 

1 < 1  r\  if  i  i  J 

i  j 

We  can  also  focus  on  a  particular  set  of  the  partition  I.  say  1  ,  and  define 

n 

P  (x.I  )  =  P(i9=iltffJ  .forecast  x).  (5.2) 

'll)  1  11 

Then  we  can  say  that  a  forecaster  is  conditionally  wel I -cal ibrated  given  the  set  I  if 

li 

„  (x.I  )  =  -- -v - .  i(  I  .  (5.3) 

11  Z  x  11 

I#  i 


Moreover,  because  being  well  calibrated  in  the  multivariate  sense  is  a  demanding  requirement. 
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we  might  also  want  to  know  if  a  forecaster  is  well-calibrated  for  some  but  not  necessarily  all 

values  of  x.  Let  X'-'  denote  a  proper  subset  of  X' Then  we  sav  that  a  forecaster  is 

0 

partially  well -calibrated  on  the  subset  X*J'  if  p(x)  =  x  for  XfX^CX'1.  We  can  now 
combine  these  notions  of  partial,  conditional,  and  marginal  calibration  in  various  ways.  (In 
particular,  we  note  that  the  concept  of  conditional  calibraiion  suggested  in  DeGroot  and 
Fienbcrg  (1982)  is  in  fact  a  combination  of  conditional  and  partial  calibration  as  defined  here.) 
We  also  consider  an  extension  of  semi-calibration,  introduced  in  Section  3.  to  the  multivariate 
setting  in  a  special  case  at  the  end  of  this  section. 

For  well-calibrated  multivariate  forecasters,  we  can  define  the  concept  of  refinement  by 
means  of  a  multivariate  stochastic  transformation  h(xjy).  Consider  two  well-calibrated 
forecasters  characterized  by  their  probability  functions  >  and  »  .  Then  we  say  that  A  is  at 

A  B 

least  as  refined  as  B  if  there  exists  a  stochastic  transformation  h  such  that: 

1  h(v|x)x\  (x)  =  v\-  (v)  for  v.X"  .  (5.4) 

i/U*  *  1  A  B  * 

X  t  A 

Note  that  the  analogue  of  equation  (2.2),  i.e. 

Z  h(y  |  x) »  (x)  =  i  (y)  for  y«X"  (5.5) 

Xi » i  A  B 

is  automatically  satisfied  by  summing  the  s  equations  in  expression  (5.4).  Furthermore,  we  can 
immediately  define  concepts  of  marginal  refinement  with  respect  to  a  partition  I.  The  concept 
of  conditional  refinement  given  the  set  1  which  also  appears  to  be  immediate  (in  a  definitional 

n 

sense)  is.  however,  problematic  as  it  involves  conditioning  of  the  vector  x  on  0«I  .  These 

conditional  predictions  have  no  operational  meaning,  because  we  cannot  define  them  only  in 
terms  of  the  probability  distribution  ».  Similarly,  the  concept  of  partial  refinement  on  the 
subset  X‘o ’  C  XUI  also  is  problematic  since  two  different  forecasters  typically  place  different 
amounts  of  subjective  probability  on  the  set  X^’. 

At  any  rate.  Theorem  2  and  Corollary  1  from  Section  2  carry  over  directly  from  the  binary 
case.  i.e..  forecaster  A  is  sufficient  for  forecaster  B  if  and  only  if  there  exists  an  appropriate 

stochastic  transformation.  h(x|v).  Moreoxcr.  suppose  we  define  a  multivariate  scoring  rule. 

g(x)  =  tg  (x) .  g(x)].  If  the  forecaster's  actual  subjective  probability  is  p  and  he  makes 
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ihe  prediction  x,  his  expected  score  is 


Ip  g  (x).  (5.6) 

l  l 

The  scoring  rule  is  strictly  proper  if  expression  (5.6)  is  maximized  if  and  only  if  x  =  p.  Then 
there  is  a  direct  m uliix ariatc  analogue  to  Theorem  9  of  Section  ?.  i.c.  cxery  strictly  proper 
scoring  rule  can  be  partioned  into  two  components,  one  of  which  is  zero  if  the  forecaster  is 
well-calibrated  and  the  other  of  which  is  a  measure  of  refinement  giving  a  total  ordering  for 
well-calibrated  forecasters. 

The  following  results  are  also  as  expected: 

Theorem  12.  If  a  multivariate  forecaster  A  is  well-calibrated,  (i)  A  is  also 
marginally  well-calibrated  with  respect  to  all  possible  proper  partitions  7  of 

(1.2 s).  (n)  A  is  conditionally  well-calibrated  given  the  set  I  C  (1.2 s).  and 

n 

(iii)  A  is  partially  well-calibrated  on  all  proper  subsets  C  X  ''. 

Theorem  13.  If  A  and  B  arc  well-calibrated  multivariate  forecasters,  and  A  is  at 
least  as  refined  as  B.  then  A  is  also  marginally  at  least  as  refined  as  B  with  respect 
to  all  possible  proper  partitions  I. 

We  have,  as  yet.  been  unable  to  provide  a  collection  of  refinement  conditions  for  dichotomies 
which  imply  multivariate  refinement.  Nor  have  we  been  able  to  prove  a  directly  verifiable  set 

of  conditions  analogous  to  Theorem  5  of  Section  3.  We  can.  however,  give  multivariate 

versions  of  Theorems  1  and  4  by  reformulating  results  of  Blackwell  (1951.  1953).  Sherman 

(1951),  Stein  (in  unpublished  lecture  notes),  and  Strassen  (1965). 

Theorem  14.  Consider  two  well-calibrated  forecasters  A  and  B.  Then  A  is  at  least  as 
refined  as  B  if  and  only  if  there  exist  discrete  random  variables  x  and  y,  defined 
on  the  (s-l)-dimcnsional  simplex,  such  that  the  marginal  probability  distribution  of 
X  is  v^.  the  marginal  probability  distribution  of  Y  is  »■  and  E(X|Y)  =  Y. 


Theorem  15.  Consider  two  well-calibrated  forecasters  A  and  B.  Then  A  is  at  least  as 
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refined  as  B  if  and  only  if.  for  every  continuous  convex  function  c(x)  defined  on 
the  (s-l)-dimensional  simplex 

I  c(x) »'  (x)  >  I  c(x)  v  (x)  .  (5.7) 

A  yf*'  B 

X(A 

We  note  that  expression  (5.~)  in  Theorem  15  is.  in  our  problem,  the  same  as  the  condition 
used  by  Fishburn  and  Vickson  (1978)  for  their  definition  of  multivariate  second-degree 
stochastic  dominance.  They  also  suggest  the  application  of  standard  feasibility  tests  of  linear 
programming  to  determine  the  existence  of  the  stochastic  transformation  which  we  use  to 
define  refinement. 


Furthermore,  we  have  the  following  direct  multivariate  extension  of  a  result  presented  in 
DeGroot  and  Eriksson  (1983). 


Theorem  16.  Consider  two  well-calibrated  forecasters  A  and  B.  Then  A  is  at  least  as 
refined  as  B  if  and  only  if  there  exists  a  stochastic  transformation  r,  such  that 

Z  xT  i?(x|y)  =  y1  for  y»X"  (5.8) 

X  f  X 

Proof :  Suppose  that  A  is  at  least  as  refined  as  B,  and  let  h  be  a  stochastic  transformation 

satisfying  (5.4).  If  we  define 


>?(x|y)  = 


h(vjx)>  (x) 


(5.9) 


whenever  i  (y)  >  0.  and  define  >?(x|y)  arbitrarily  of  »■  (y)  =  0.  then  (5.8)  follows  directly 
from  (5.4).  Conversely,  suppose  that  (5.8)  is  satisfied  for  some  >?  and  define  the  stochastic 
transformation  h  by  (5.9).  [Note  that  h(y|x)  may  be  defined  arbitrarily  if  i  (x)  =  0.]  Then 
(5.4)  follows  directly  from  (5.8).  ■ 


A  stochastic  transformation  n  satisfying  expression  (5.8)  is  known  in  the  economics  literature 
as  a  mean- preserving  spread  (sec.  c.g..  Rothschild  and  Sliglitz.  1970.  1973). 

An  interesting  version  of  the  multivariate  selling  results  when  the  probability  of  outcome  i 
given  that  the  forecaster  predicts  x  depends  only  on  the  forecaster's  subjective  probability  x 
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for  outcome  i.  (This  is  clearly  true  if  the  forecaster  is  well-calibrated).  We  say  that  the 
forecaster  is  local  if 

/>(x)  =  p  (x  )  i  =  1.2 . s.  (5.10) 

If  the  functions  p  (•)  are  monotonically  increasing,  then  the  forecaster  is  marginally  semi- 
calibrated.  in  the  sense  of  Section  3.  A  special  case  of  locality  is  linearity,  and  an  interesting 
question  arises:  Under  what  conditions  on  X"  and  the  p's  is  a  multi\anaie  forecaster  being 
local  equivalent  to 

P  (x)  =  b  x  ♦  b  i  =  1.2 . s.  (5.11) 

ri  0  i  i 

where  b  .  b .  b  2  0.  and  b  -  b  ♦  ...  +  b  =  1?  If  the  functions  .<  are  known  to  be 

0  1  0  1  -  '  . 

continuous  on  the  entire  simplex,  thr  it  can  be  shown  that  they  must  be  linear  for  any  local 
forecaster. 

Suppose  we  now  say  that  Forecaster  A  dominates  Forecaster  B  on  the  outcome  I  if  the 
marginal  distribution  of  the  ith  prediction  component  for  forecaster  A  given  that  outcome  i 
occurs  is  stochastically  larger  than  the  corresponding  marginal  distribution  for  B.  We  know 
from  Theorem  7  of  Section  3  that,  if  multivariate  forecasters  A  and  B  are  both  well-calibrated 
and  A  dominates  B  on  all  s  outcomes,  then  A  is  marginally  at  least  as  refined  as  B  with 
respect  to  each  possible  outcome.  An  open  question  is  whether  it  is  possible  to  use  calibration, 
locality,  linearity  as  in  (5.9).  or  semi-calibration  in  connection  with  some  version(s)  of 
dominance  to  imply  that  one  forecaster  is  sufficient  for  (or  more  refined  than)  another  in  our 


full  multivariate  sense. 
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